Implement a Python PEG parser in Rust #566

zsol · 2021-12-17T16:29:28Z

Summary

This massive PR implements an alternative Python parser that will allow LibCST to parse Python 3.10's new grammar features. The parser is implemented in Rust, but it's turned off by default through the LIBCST_PARSER_TYPE environment variable. Set it to native to enable. The PR also enables new CI steps that test just the Rust parser, as well as steps that produce binary wheels for a variety of CPython versions and platforms.

Note: this PR aims to be roughly feature-equivalent to the main branch, so it doesn't include new 3.10 syntax features. That will be addressed as a follow-up PR.

The new parser is implemented in the native/ directory, and is organized into two rust crates: libcst_derive contains some macros to facilitate various features of CST nodes, and libcst contains the parser itself (including the Python grammar), a tokenizer implementation by @bgw, and a very basic representation of CST nodes. Parsing is done by

tokenizing the input utf-8 string (bytes are not supported at the Rust layer, they are converted to utf-8 strings by the python wrapper)
running the PEG parser on the tokenized input, which also captures certain anchor tokens in the resulting syntax tree
using the anchor tokens to inflate the syntax tree into a proper CST

These steps are wrapped into a high-level parse_module API here, along with parse_statement and parse_expression functions which all just accept the input string and an optional encoding.

These Rust functions are exposed to Python here using the excellent PyO3 library, plus an IntoPy trait which is mostly implemented via a macro in libcst_derive.

Test Plan

Apart from Rust unit tests and roundtrip tests, the new implementation is also tested by running the pure python tests on them (by setting LIBCST_PARSER_TYPE=native before running python -m unittest). All of these are part of CI steps.

Known missing features

Apart from the soon-to-come 3.10 syntactic features mentioned above, here's an incomplete list of the features the Rust implementation lacks compared to the pure Python parser:

No support for limiting the parser to a certain Python version. The parser will accept input that's valid in any of Python 3.9, 3.8, 3.7, or 3.6 at the moment.
The native parser API only accepts utf-8 encoded strings. Support for bytes is provided in the wrapping python layer.
In some places the old parser returned a tuple of CSTNodes, the new parser returns a list of them (see adjusted unit tests)

This adds an experimental new native extension, with the goal of improving performance, currently just reimplementing `whitespace_parser`. This appears to make parsing marginally faster overall, but the performance gains are pretty limited because whitespace parsing is a small portion of the overall parsing process, and there's some added overhead of converting types (mostly unicode strings) between rust and python. Instead, this helps show what's needed to port part of the codebase over to rust in an example that's small enough to be digestible, but big enough to be more than a trivial toy. I originally started by trying to port the tokenizer to rust, since it takes a larger portion of time and it's infrequently modified (making it a good candidate), but found issues with the implementation direction I was taking, so I scrapped that for now.

Ports CPython's tokenize.c to rust, and adds features to it to (mostly) match the behavior of the Parso-based tokenizer. Some tests are currently failing, but most of this is just due to differences between error handing in the Python in Rust implementations. I'll either fix or disable these cases in the future. Like the whitespace_parser commit, this appears to make parsing faster by a small margin, but tokenizing is just one small part of the overall parsing process.

There are also tests and a simple benchmark.

…mplemented)

bgw · 2021-12-17T21:05:48Z

Oh my god. The mad lad, @zsol actually did it. Congratulations!

I don't think I can reasonably read through all this code, but I might look through it a bit later. None of that should prevent you from merging though.

zsol · 2021-12-17T21:57:18Z

Thanks! I'd still appreciate any review or thoughts for even parts of the code, because I'm 100% sure there are some questionable decisions in there that would benefit from ... questioning :) Can incorporate any changes in future PRs.

stroxler · 2021-12-18T17:40:27Z

Wow, there's a lot here! I'll also try to read through parts of it.

My view is that if tests pass we can probably go ahead and merge it even if there are questionable bits as long as the CST data itself is reasonable, nothing prevents us from revisiting the code later and getting 3.10 PEG support landed is pretty time-sensitive.

Before I try to dive in, a few high level questions:

(1) are there any particular parts of the code I should focus on for review? I'm coming off of making some changes to the cpython parser for PEP 677 so I can try to focus on that, but there's a big difference between making one change, even a nontrivial one, versus trying to take in the whole thing at once!

(2) are there any changes you'd like to make on top of this? It might help me build context to not just read the code but make a small modification or two if there's something you already have in mind.

(3) for code that can parse using either the old or new parser, is the produced CST the same? My impression is yes given your description but just want to double-check. In my mind even if there are messy bits, if this is true we should err on the side of merging fast, especially since it's off by default.

zsol · 2021-12-19T10:43:34Z

are there any particular parts of the code I should focus on for review? I'm coming off of making some changes to the cpython parser for PEP 677 so I can try to focus on that, but there's a big difference between making one change, even a nontrivial one, versus trying to take in the whole thing at once!

In that case, I'd recommend grammar.rs which implements step 2 from the PR description. Specifically I'm a bit worried about accepting syntax that shouldn't be accepted.

are there any changes you'd like to make on top of this? It might help me build context to not just read the code but make a small modification or two if there's something you already have in mind.

Yeah, I'd like to add match statements myself, but you could try extending the grammar to support parenthesized context managers (https://bugs.python.org/issue12782)

for code that can parse using either the old or new parser, is the produced CST the same?

Yes! The only case where this isn't true is the tuple-vs-list problem I mention in the PR description. PyO3 automatically converts rust's List<_> into python List[...], but in some cases the pure python parser produces a tuple.

libcst/_nodes/tests/test_import.py

thatch · 2021-12-20T01:51:27Z

libcst/_nodes/tests/test_atom.py

@@ -1120,6 +1121,8 @@ def test_invalid(self, **kwargs: Any) -> None:
        )
    )
    def test_versions(self, **kwargs: Any) -> None:
+        if is_native() and not kwargs.get("expect_success", True):
+            self.skipTest("parse errors are disabled for native parser")


Do you have a plan for making them work? This is an important part of the API.

Yeah it's possible to make it work, but my priority is to get 3.10 working first. The old parser won't be removed until this feature is supported by the new one.

thatch · 2021-12-20T01:56:24Z

libcst/_parser/detect_config.py

@@ -114,6 +114,23 @@ def _detect_future_imports(tokens: Iterable[Token]) -> FrozenSet[str]:
    return frozenset(future_imports)


I'm not 100% sure this has good tests. Does this avoid having to tokenize twice?

Hmm, I don't think this code is ever getting run in native mode. Will probably have to be reimplemented on the rust side in the future. As far as I can tell this is only being used to figure out the exact grammar version to be used.

thatch · 2021-12-20T02:02:21Z

Can you expand more on

No support for limiting the parser to a certain Python version

Are you intending on adding that before a non-pre release? I put a decent chunk of effort in supporting old versions and even have a draft going back to 2.3 that I'd be happy to rework if version-dependent support existed. See #275.

stroxler · 2021-12-20T23:41:16Z

@thatch my impression of what @zsol was saying about python versions is just that the parser won't throw errors if you use Python 3.9 syntax in a 3.6 codebase.

In my mind this isn't a huge problem, generally there are other ways to get pretty fast feedback about syntax errors, and it's not critical that LibCST deal with it... if there's an exception to that I'd say it's actually on the output of codemods, not the input, but presumably that's not the parser anyway

bgw · 2021-12-21T00:44:00Z

the parser won't throw errors if you use Python 3.9 syntax in a 3.6 codebase

How does this work for bits of Python grammar that were changed in backwards-incompatible ways? In older versions of Python, it was valid to use async and await as variable names. In 3.7+, it's no longer valid. However, if you enable the async/await tokenizer hacks on 3.7+ code, you might run into issues with async generator expressions (https://bugs.python.org/issue31708).

stroxler

I read the grammar itself (not the entire grammar.rs file, just up to line 1180 or so) end-to-end and compared to the main grammar at
https://docs.python.org/3.11/reference/grammar.html
as well as, where necessary, the full cpython PEG at
https://github.com/python/cpython/blob/main/Grammar/python.gram

Overall it looks good to me, especially in light of having test coverage.

I flagged a number of things:

[substantive] I think I only found 2 or 3 things that might be wrong, and
none of them seem like they ought to block merge of an off-by-default
new parser - I'd say address them if you want or we can merge and follow
up later
[nit / curiousity] a few places the rules seemed awkward, or slightly out of
sync with the python official grammar in ways that appeared unnecessary.
I definitely may be missing something though, marked these as [nit], [curiosity]
or [both] and it's probably fine to merge without addressing (although if there
are reasons I'd appreciate an explanation since I'd learn something!)
[compat] I've marked rules that are not in a great order or otherwise have
cosmetic issues relative to the official grammar as [compat]. Feel free to
ignore these, they shouldn't block land but I do think I might follow up
later with a reorder because that would make checking them against
one another easier going forward

libcst/_parser/entrypoints.py

native/libcst/src/parser/grammar.rs

stroxler · 2021-12-21T00:47:13Z

native/libcst/src/parser/grammar.rs

+                make_import_alias(n, asname)
+            }
+
+        rule dotted_name() -> NameOrAttribute<'a>


[compat] this rule diverges pretty heavily from the latest grammar which uses a recursive rule instead.

I'm pretty certain they are equivalent

I forget why this was necessary. Maybe it wasn't? Not sure.

stroxler · 2021-12-21T03:04:31Z

How does this work for bits of Python grammar that were changed in backwards-incompatible ways?

Not sure - I was mostly describing my understanding of @zsol's comment about supporting all versions, but it does appear to me that the new parser is using async / await as dedicated tokens.

I wonder if a solution is to only use the new parser, when we roll it out as default, only for 3.7+? The main motivation to move to the PEG is to properly support newer python versions. If we want to support 3.6 properly with the native parser we could always put the time into that later.

amyreese · 2021-12-21T22:54:47Z

No support for limiting the parser to a certain Python version. The parser will accept input that's valid in any of Python 3.9, 3.8, 3.7, or 3.6 at the moment.

@zsol can you confirm if you intend to add support for this functionality before the Rust parser becomes enabled by default? ie, have separate grammar for each major release of CPython, and ensure strict parsing when given a target version?

There is a lot of work that Tim had in-flight for supporting older releases of CPython, potentially back to 2.x versions, that I don't want to lose, as well as the functionality in guaranteeing correctness and safety when parsing/modifying/dumping trees for production code. Eg, when writing codemods, we want to know that if we ask to parse with 3.7 code, that we're not getting trees/nodes/grammar that are only available in 3.8 or 3.9, and vice versa.

zsol · 2021-12-22T08:27:03Z

Yeah this feature won't be lost one way or another. Having said that, I don't think supporting python 2 grammars is very valuable at this point, as I commented on Tim's issue recently

…

On Tue, 21 Dec 2021, 22:54 John Reese, ***@***.***> wrote: No support for limiting the parser to a certain Python version. The parser will accept input that's valid in any of Python 3.9, 3.8, 3.7, or 3.6 at the moment. @zsol <https://github.com/zsol> can you confirm if you intend to add support for this functionality before the Rust parser becomes enabled by default? ie, have separate grammar for each major release of CPython, and ensure strict parsing when given a target version? There is a lot of work that Tim had in-flight for supporting older releases of CPython, potentially back to 2.x versions, that I don't want to lose, as well as the functionality in guaranteeing correctness and safety when parsing/modifying/dumping trees for production code. Eg, when writing codemods, we want to know that if we ask to parse with 3.7 code, that we're not getting trees/nodes/grammar that are only available in 3.8 or 3.9, and vice versa. — Reply to this email directly, view it on GitHub <#566 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAQJNDGDOLXUVSGWPW4IDLUSEAUHANCNFSM5KJHLMVA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

bgw and others added 30 commits December 7, 2021 10:53

Basic FunctionDefinition parsing and codegen

f0d3004

There are also tests and a simple benchmark.

add license headers

44e785a

Convert WhitespaceState to not own absolute indent and move it to parser

f471daf

remove python wrappers for now

e26a9c3

expressions

68516a6

tracing

728d856

nicer error messages

275774d

bump peg version

1b2b8ea

prettier error messages

b32f953

newline handling

3ed1b40

blocks

e198d01

nicer error messages

d195023

remove unused dep

f83cf4f

only show part of input in error msg

f5fff7c

keep newlines when splitting by line

222e89e

if statements

f02c2bf

parameters

4414c5f

add func def fixture

0c040dc

more whitespace to function def

a568392

tweak benchmarks

07812f1

improve codegen perf

dff70a6

use cpu cycles for benchmarks

f83355b

imports

f1a363e

add test fixture

0180d33

add parentheses to imports

d4de127

eliminate warnings

7ecd5ae

comparisons (note tests are failing because operator parsing is not i…

9c06f52

…mplemented)

operators

016809d

zsol added 5 commits December 16, 2021 16:54

update codegen and formatting

6925202

run unittests with rust too

27cd771

use cibw to build wheels

e430bbd

fix codemod cli test in native/3.6 mode

e363058

fix rust unittests on macos

30efa4e

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 17, 2021

zsol mentioned this pull request Dec 17, 2021

3.10+ support #285

Closed

kamahen mentioned this pull request Dec 18, 2021

A crash in Python 3.5: lib2to3.pgen2.parse.ParseError: bad input: type=16, value='*' google/yapf#825

Closed

thatch reviewed Dec 20, 2021

View reviewed changes

libcst/_nodes/tests/test_import.py Outdated Show resolved Hide resolved

thatch reviewed Dec 20, 2021

View reviewed changes

revert accidental test change

bfa285f

zsol mentioned this pull request Dec 20, 2021

ParserSyntaxError on from foo import (bar \n as baz) #530

Open

stroxler approved these changes Dec 21, 2021

View reviewed changes

zsol added 2 commits December 21, 2021 17:43

address review comments & reorder grammar rules to match CPython

b188f42

silence lints

11cd13b

zsol merged commit c02de9b into Instagram:main Dec 21, 2021

zsol mentioned this pull request Dec 21, 2021

#[cache_left_rec] not work in peg v0.7.0 kevinmehall/rust-peg#278

Closed

stroxler mentioned this pull request Dec 22, 2021

[native] Add a rust implementation of whitespace_parser #452

Closed

zsol mentioned this pull request Aug 11, 2022

Add support for limiting the native parser to a certain Python version #740

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a Python PEG parser in Rust #566

Implement a Python PEG parser in Rust #566

zsol commented Dec 17, 2021

bgw commented Dec 17, 2021

zsol commented Dec 17, 2021

stroxler commented Dec 18, 2021

zsol commented Dec 19, 2021

thatch Dec 20, 2021

zsol Dec 20, 2021

thatch Dec 20, 2021

zsol Dec 20, 2021

thatch commented Dec 20, 2021

stroxler commented Dec 20, 2021

bgw commented Dec 21, 2021

stroxler left a comment

stroxler Dec 21, 2021

zsol Dec 21, 2021

stroxler commented Dec 21, 2021

amyreese commented Dec 21, 2021

zsol commented Dec 22, 2021 via email

		@@ -114,6 +114,23 @@ def _detect_future_imports(tokens: Iterable[Token]) -> FrozenSet[str]:
		return frozenset(future_imports)

Implement a Python PEG parser in Rust #566

Implement a Python PEG parser in Rust #566

Conversation

zsol commented Dec 17, 2021

Summary

Test Plan

Known missing features

bgw commented Dec 17, 2021

zsol commented Dec 17, 2021

stroxler commented Dec 18, 2021

zsol commented Dec 19, 2021

thatch Dec 20, 2021

Choose a reason for hiding this comment

zsol Dec 20, 2021

Choose a reason for hiding this comment

thatch Dec 20, 2021

Choose a reason for hiding this comment

zsol Dec 20, 2021

Choose a reason for hiding this comment

thatch commented Dec 20, 2021

stroxler commented Dec 20, 2021

bgw commented Dec 21, 2021

stroxler left a comment

Choose a reason for hiding this comment

stroxler Dec 21, 2021

Choose a reason for hiding this comment

zsol Dec 21, 2021

Choose a reason for hiding this comment

stroxler commented Dec 21, 2021

amyreese commented Dec 21, 2021

zsol commented Dec 22, 2021 via email